Effect of Instruction Fetch and Memory Scheduling on GPU Performance
نویسندگان
چکیده
GPUs are massively multithreaded architectures designed to exploit data level parallelism in applications. Instruction fetch and memory system are two key components in the design of a GPU. In this paper we study the effect of fetch policy and memory system on the performance of a GPU kernel. We vary the fetch and memory scheduling policies and analyze the performance of GPU kernels. As part of our analysis we categorize applications as symmetric and asymmetric based on the instruction lengths of warps. Our analysis shows that for symmetric applications, fairness based fetch and DRAM policies can improve performance. However, asymmetric applications require more sophisticated policies.
منابع مشابه
Petri Net Analysis of Non-Redundant and Redundant Execution Schemes
The quest for high-performance has led to multiand many-core systems. To push the performance of a single core to the limit, simultaneous multithreading (SMT) is used. SMT enables to fetch different instructions from different threads, hiding latencies in other threads. SMT also gives the opportunity to execute redundant threads (redundant multithreading, RMT) and thus to detect faults by compa...
متن کاملEffective Instruction Prefetching In Chip Multiprocessors
threaded application performance, often achieved through instruction level parallelism per chip is increasing, the software and hardware techniques to exploit the potential of studies mostly involve distributed shared memory multiprocessors and fetching will not be fully effective at masking the remote fetch latency. the effective address of the load instructions along that path based upon a hi...
متن کاملTrace Cache Performance
Instruction fetch mechanism is a performance bottleneck of a Superscalar Processor. Fetch performance can be improved with the aid of an instruction memory known as a Trace Cache. This paper presents analytical expressions, which describe instruction fetch performance of a Trace Cache microarchitecture. The instruction fetch rates predicted by the expressions differ by seven percent from the si...
متن کاملA Compiler-driven Supercomputer
The overall prrformance of supercomputers is slow compared to the speed of their underlying logic technology. This discrepancy is due to several bottlenecks: memories are slower than the CPU, conditional jumps limit the usefulness of pipelining and pre-fetching mechanisms, and functional-unit parallelism is limited by the speed of hardware scheduling. This paper describes a supercomputer archit...
متن کاملA Graph-based Model for GPU Caching Problems
A GPU is a massively parallel computational accelerator that is equipped with hundreds or thousands of cores. A single GPU can provide more than 4 Teraflops single precision performance at its peak, however, the maximum memory throughput of a GPU card is around 200 GB/s. Such a gap usually prevents GPU’s computation power from being fully harnessed. A cache is a layer in between GPU’s computati...
متن کامل